In [1]:
import pandas as pd
import numpy as np
from collections import Counter

This notebook explains the theory behind microbial sourcetracking using a Gibb's sampler.

The theory and examples are based on work detailed in Knights et al. 2011: if you find this work helpful please consider citing Dan Knights paper.

Note:

The formula for calculating the probability that a held-out sequence will be assigned to a given environment is reported incorrectly in Knights et al. 2011. The corrected formula is:

$$P( z_{i} = v \mid z^{ \neg i}, x) \propto P(x_{i} \mid v) \times P(v \mid x^{ \neg i}) = \begin{cases} \frac{m_{x_{i}v} + \alpha_{1}}{m_{v} + \tau \alpha_{1}} \times \frac{n_{v}^{ \neg i} + \beta}{n - 1 + \beta V} & v < V \\ \\ \frac{m_{x_{i}v} + n \alpha_{2}}{m_{v} + \tau n \alpha_{2}} \times \frac{n_{v}^{ \neg i} + \beta}{n - 1 + \beta V} & v = V \end{cases}$$

This updated formula is what is truly being calculated by both the R sourcetracking algorithm and this repository (personal communication with Dan Knights).

Problem

Given a number of sources, determine the contribution of each source to a sink.

The Gibb's sampler works in four basic steps:

1. Randomly assign the sequences in a sink to source environments. These random assignments represent which source a given sink sequence came from.
2. Select one of the sequences from step 1, calculate the *actual* probabilities of that sequence having come from any of the source environments, and update the assigned source environment of the sequence based on a random draw with the calculated probabilities. Repeat many times. 
3. At intervals in the repeats of step 2, take the source environment assingments of all the sequences in a sink and record them. 
4. After doing step 3 a certain number of times (i.e. recording the full assignments of source environments for each sink sequence), terminate the iteration and move to the next sink.  

Machinery

There are two probabilities that form the basis of the Gibb's calculation, namely:

$P(t \mid v)$ = probability of a source emitting a sequence from a specific taxon

$P(v)$ = probability of a sequence having come from a given source

$P(t \mid v)$

$P(t \mid v)$, referenced in the code as p_tv, is the probability of a source emitting a sequence from a specific taxon. Said another way, this is the probability of seeing a sequence from taxon $t$ being produced from source $v$. Let's look at an example table:


In [20]:
table = pd.DataFrame(data=[[0, 10, 100], [3, 6, 9]], 
                     columns=['F1', 'F2', 'F3'], 
                     index=['Source1', 'Source2'])
table


Out[20]:
F1 F2 F3
Source1 0 10 100
Source2 3 6 9

Imagine that the above table represents a pair of urns (Source1 and Source2) and that each urn has 3 colors of billiard in it (colors F1 through F). Now, lets say you close your eyes, and randomly select an urn, say Source2. You reach in to that urn and withdraw a random billiard. What is the probability that you have selected a billiard of color F1?

We can make this conditional probability calculation very simply by dividing each Feature count within a source by the total sequences seen in that Source. This would also be the relative abundance for each Source.


In [26]:
# sum each OTU
table = table.apply(lambda row: row / row.sum(), axis=1)
table


Out[26]:
F1 F2 F3
Source1 0.000000 0.090909 0.909091
Source2 0.166667 0.333333 0.500000

Now we can calculate the $P(t \mid v)$. Let's imagine that we saw F1 in our Sink. What is the probability that F1 came from either Source1 or Source2? Well, from the table above we can see that F1 was not seen in Source1. Therefore, we update the table by dividing by each column sum:


In [28]:
# Calculate the probability (p_tv) for each OTU
table_p_tv = table.loc[['Source1', 'Source2']].div(table.sum(axis=0), axis=1)
table_p_tv


Out[28]:
F1 F2 F3
Source1 0.0 0.214286 0.645161
Source2 1.0 0.785714 0.354839

This table shows $P(t \mid v)$ for each $t, v$ (i.e each OTU and Source).

To check our understanding, we can think about the values in this matrix. If we see F1 in the sink sample, it could only have come from Source2, and therefore the p_tv (the probability of seeing F1) for Source2 is 100%. If F3 was seen in our sink, there would be a 91.7% chance of it coming from Source1 and a 8.3% chance of it coming from Source2.

The calculation for p_tv is really about this simple. We add a few tuning parameters to this calculation to ensure that more of the probability space is explored. In the biologial context, our idea is that because we sequence only a small fraction of the source environment, we are likely to miss some sources of some taxa. The parameters described below help us correct for this.

Alpha1

You may note above that OTU1's abundance in Source1 is 0, and therefore no probability can be provided for that OTU to that Source. SourceTracker2 therefore allows the user to specify an alpha1 value on the command line, which is the prior counts to be added to all of the sample counts. This would therefore provide a small amount of probability to OTU1 in Source1, and the table would look like (assuming an alpha1 of 0.01):


In [44]:
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1]], 
                     columns=['F1', 'F2', 'F3'], 
                     index=['Source1', 'Source2'])
table


Out[44]:
F1 F2 F3
Source1 0.1 10.1 100.1
Source2 3.1 6.1 9.1

Unknown and Alpha2

A key feature of SourceTracker2 is to create an Unknown source environment. The basic premise here is to identify sequences that don't have a high probability of being derived from the known and specified Source Environments. An Unknown community is propogated with sequences with the alpha2 parameter that specifies what percent of the total sink sequence count to be added for each OTU in the Unknown. For example, a sink with 100 sequences and an alpha of 0.001 would add 100 * 0.001 = 0.1 counts to each OTU, and the table would look like this:


In [45]:
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1], [0.1, 0.1, 0.1]], 
                     columns=['F1', 'F2', 'F3'], 
                     index=['Source1', 'Source2', 'Unknown'])
table


Out[45]:
F1 F2 F3
Source1 0.1 10.1 100.1
Source2 3.1 6.1 9.1
Unknown 0.1 0.1 0.1

In [46]:
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1], [0.1, 0.1, 0.1]], 
                     columns=['OTU1', 'OTU2', 'OTU3'], 
                     index=['Source1', 'Source2', 'Unknown'])
table_p_tv = table.loc[['Source1', 'Source2']].div(table.sum(axis=0), axis=1)
table_p_tv


Out[46]:
OTU1 OTU2 OTU3
Source1 0.030303 0.619632 0.915828
Source2 0.939394 0.374233 0.083257

These are the precalculated p_tv values we can use.

$P(v)$

The second probability that the Gibb's sampler must calculate is $P(v)$, the probability that a sequence came from a given source. In the code, we refer to this value as p_v. In the interal loops of the Gibb's function, we randomly assign sequences from a sink to having come from a given source. Iteratively, we remove one of each of these sequences (one at a time) and reassign the origin of that sequence to an environment. p_v is the probability that one of these held out sequences came from any one of the environments.

We'll walk through an example below.Each sink sample can be thought of a collection of sequences that identify a taxa:


In [47]:
sink_otus = [1, 1, 1, 2, 2, 2, 2, 2, 2, 3]

Our sink sample contains 10 sequences that each are assigned to one of the Fs in our table. Let's forget for a moment that each Feature has a certain probability of being found in a given source environment (the p_tv), and instead focus on the number of possible Source environments we have. If our sink sample was completely unidentified with Featre information, it would look like:


In [48]:
sink = ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']

And if we only knew that we had 3 source environments that each sink sequence must have come from, then we would say that each sink sequence had a 1/3 probability of being assigned to each source.

Let's randomly assign each sink sequence to a source environment:


In [49]:
sink_source = [3, 1, 2, 3, 2, 3, 3, 1, 2, 1]
# where 1 = Source1, 2 = Source2, and 3 = Unknown

To calculate our p_v, we count the number of sequences assigned to each environment and divide by the total.


In [50]:
p_v = np.bincount(sink_source, minlength=3)[1:]/float(len(sink_source))
print(p_v)


[ 0.3  0.3  0.4]

During the internal loops of the Gibb's function, we'll end up withdrawing a sequence from the sink_source vector, and then re-calculating the p_v. For example, lets say we are on the 5th iteration, so we withdraw the 5th sequence:


In [51]:
sink_source_copy = sink_source[:]
env_of_fifth_sequence = sink_source_copy.pop(4)
# recalculate p_v
p_v = np.bincount(sink_source_copy, minlength=3)[1:]/float(len(sink_source))
print(p_v)


[ 0.3  0.2  0.4]

We are calculating $P(v \mid x^{ \neg i})$ in the overall SourceTracker2 formula (seen in the first cell). In the actual code, p_v is multiplied by p_tv to create a new probability vector for identifying where the sequence that was withdrawn actually came from.

As with p_tv there are tuning parameters that we add to make the model explore a larger portion of probability space.

Beta

Beta is added to the count of each environment, to prevent an environment from having zero probability mass. In many ways it is like alpha1 which we add to the source matrix data.

Restarts

When we restart the gibbs sampler, we reshuffle the assignments of each sink sequence to a Source Environment, and begin our calculations again, essentially redoing all the steps above. This is done to allow the proportions to converge in an independent Markov Chain. Since there is high correlation between adjacent states in the Markov chain, the restarts and delay parameters of SourceTracker2 are used to avoid correlated answers.

Sampling


In [52]:
sink_otus = [1, 1, 1, 2, 2, 2, 2, 2, 2, 3]
sink_otus_positions_shuffled = [4, 1, 8, 2, 7, 9, 0, 3, 5, 6]

The next thing we do is create a shuffled order for which to walk through our sink sequences. The above says that the first sequence we want to predict the source environment contributor for is the 4 index, which relates to F2.


In [53]:
# For OTU2, taken from the table above
# for Source1, Source2, Unknown
p_tv = np.array([0.619632, 0.374233, 0.006135])
p_v = np.array([0.3, 0.3, 0.4])

combined_probability = p_tv * p_v
combined_probability / combined_probability.sum()


Out[53]:
array([ 0.61836744,  0.37346926,  0.00816331])

We have now calculated the probability that this sink sequence at index 4 should belong to each Source Environment and the Unknown. We can then randomly assign that sequence to one of the environments given those probability distributions. We also update our p_v by adding 1 to the selected source community.

IF the sequence gets assigned to the Unknown community, then the Unknown community gets an extra count in the table. In this way, the Unknown community gets propogated through repeated iterations. No changes are made to the table if the sequence gets assigned to an identified source (Source1 or Source2).

Let's assume the unlikely case that our sample did get assigned to the Unknown.


In [54]:
# Update the table with the OTU2 count
table = pd.DataFrame(data=[[0.1, 10.1, 100.1], [3.1, 6.1, 9.1], [0.1, 1.1, 0.1]], 
                     columns=['F1', 'F2', 'F3'], 
                     index=['Source1', 'Source2', 'Unknown'])
table


Out[54]:
F1 F2 F3
Source1 0.1 10.1 100.1
Source2 3.1 6.1 9.1
Unknown 0.1 1.1 0.1

In [55]:
# and recalculate the p_tv values
table_p_tv = table.loc[['Source1', 'Source2']].div(table.sum(axis=0), axis=1)
table_p_tv


Out[55]:
F1 F2 F3
Source1 0.030303 0.583815 0.915828
Source2 0.939394 0.352601 0.083257

In [56]:
# Update the p_v values
# change the 4th index to community 3
sink_source = [3, 1, 2, 3, 3, 3, 3, 1, 2, 1]
Counter(sink_source)


Out[56]:
Counter({1: 3, 2: 2, 3: 5})

And now:


In [57]:
sink_otus = [1, 1, 1, 2, 2, 2, 2, 2, 2, 3]
sink_otus_positions_shuffled = [4, 1, 8, 2, 7, 9, 0, 3, 5, 6]

We'd move on to index position 1, and repeat the entire process

After so many iterations, or burn ins, we can start to record the p_v source environment counts as the percent contribution of each Source to our sink sample. The delay determines how many iterations to skip between draws after the burnin period. The draws_per_restart indicate how many draws to save from each restart. All draws are then averaged together to provide a final mixing proportion.